Skip to content

feat: migrate control plane to vercel webhook and consolidate agents on cloud runs#400

Merged
captainsafia merged 52 commits intomainfrom
oz-agent/control-plane-migration
Apr 30, 2026
Merged

feat: migrate control plane to vercel webhook and consolidate agents on cloud runs#400
captainsafia merged 52 commits intomainfrom
oz-agent/control-plane-migration

Conversation

@captainsafia
Copy link
Copy Markdown
Collaborator

@captainsafia captainsafia commented Apr 28, 2026

Summary

This PR migrates the triage / respond-to-triaged / PR-review workflows off Docker and onto Warp-hosted cloud agent runs, and lays down a Vercel-based control plane that will eventually replace the GitHub Actions delivery surface entirely.

The cloud-mode rewrite is the immediately-active change: each of the three Python entrypoints now calls run_agent (cloud) + build_agent_config(role="review-triage") and reads results through the new oz_workflows.artifacts.load_*_artifact helpers. Their prompts now describe a oz artifact upload <name>.json handoff instead of a /mnt/output mount; security rules, output schemas, and skill references survive the rewrite verbatim. The Docker assets (docker/triage/, docker/review/, build-{triage,review}-image composite actions, docker_agent.py, test_docker_agent.py) are deleted, and the three workflow YAMLs that used them are updated to drop the Build … agent container step and forward WARP_ENVIRONMENT_ID + WARP_REVIEW_TRIAGE_ENVIRONMENT_ID to the script step. The workflow YAMLs themselves are intentionally retained so the existing GitHub Actions delivery path keeps working through the cutover.

The new control-plane/ Python project is the long-term target: a Vercel webhook handler that verifies HMAC-SHA256 signatures and routes events to a workflow handler, plus a 1-minute cron poller that reads in-flight run state from Vercel KV and applies completed cloud-agent results back to GitHub. Lib helpers cover signatures, routing, trust evaluation, dispatch, in-flight state, GitHub App token exchange, and the cron drain loop. The full architecture and deployment runbook live in control-plane/README.md.

Workflows served by the Vercel webhook in this PR

The webhook + cron control plane now owns the following delivery surface, and the corresponding .github/workflows/*.yml + .github/scripts/*.py shims have been removed in favor of the cloud-mode helpers under lib/scripts/:

  • review-pull-request (PR opened / ready_for_review / review_requested / labeled / /oz-review).
  • enforce-pr-issue-state (PR synchronize / edited).
  • respond-to-pr-comment (@oz-agent mention on PRs and review threads).
  • verify-pr-comment (/oz-verify on PRs).
  • triage-new-issues — newly added: issues.opened on non-triaged issues, @oz-agent mention on a non-triaged issue, and needs-info reporter replies all route through the webhook.

@oz-agent mentions on already-triaged issues continue to flow through the legacy respond-to-triaged-issue-comment GitHub Actions workflow until that workflow is migrated in a follow-up.

Cutover steps after merge

  1. Provision the Vercel project — point a new Python project at control-plane/. vercel.json declares the runtime, both functions, and the 1-minute cron schedule.
  2. Set Vercel project secretsOZ_GITHUB_WEBHOOK_SECRET, OZ_GITHUB_APP_ID, OZ_GITHUB_APP_PRIVATE_KEY, WARP_API_KEY, WARP_API_BASE_URL, WARP_ENVIRONMENT_ID, WARP_REVIEW_TRIAGE_ENVIRONMENT_ID, CRON_SECRET. Detail per-secret in control-plane/README.md.
  3. Provision Vercel KV — add a KV resource to the Vercel project. The cron handler imports vercel_kv lazily.
  4. Update the GitHub App webhook URL — flip the App's webhook URL from the GitHub Actions delivery target to https://<project>.vercel.app/api/webhook. The webhook handler returns 202 with the routed workflow id so the Recent Deliveries UI stays green.
  5. Open the follow-up PR to delete .github/workflows/* per plan §5a — once the Vercel control plane is verified end-to-end, delete the remaining legacy GitHub Actions YAMLs in a follow-up PR. This PR keeps the issue-triggered helpers (respond-to-triaged-issue-comment, create-spec-from-issue, create-implementation-from-issue, comment-on-*) and the plan-approval workflows in place so both delivery paths can be exercised in parallel during cutover.

Validation

  • python -m pytest tests191 tests passed, 47 subtests passed (signature verification, routing table, trust evaluation, dispatch, cron drain loop, builder lifecycle, handler wiring, triage prompt + apply helpers).
  • PYTHONPATH=lib:.github/scripts python -m unittest discover -s .github/scripts/tests294 OK (helpers, role parameter, named artifact loaders, cloud-mode triage/review/respond-to-triaged dispatch, and skill-section assertions; the triage-specific tests have moved to tests/test_triage.py).
  • Triage / review / respond-to-triaged Security Rules: blocks were diff-checked byte-for-byte against main to confirm the cloud rewrite did not weaken or relax the prompt-injection / output-schema rules.

References

Plan id: 7e8e8b6a-9e8a-4cbf-ab95-dd37cf4cc44c.

Conversation: https://staging.warp.dev/conversation/da08a6c7-4f86-4dac-99f6-3358cfe3258e
Run: https://oz.staging.warp.dev/runs/019dd6d9-cb65-73a2-8cfb-d03d37afd03a
Plans:

This PR was generated with Oz.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
oz-for-oss Ready Ready Preview, Comment Apr 30, 2026 5:36pm

Request Review

Copy link
Copy Markdown
Collaborator Author

Round 2: Vercel control plane wiring

This round adds the control-plane wiring that turns the scaffolding from round 1 into an end-to-end PR-flow runtime. Every PR-triggered webhook now flows through the Vercel control plane (webhook → builder → cloud-agent dispatch → KV → cron → apply-result-to-GitHub).

Commits

  • 787e22f feat(oz_workflows): add fire-and-forget dispatch_run helper — adds an optional client: OzAPI | None = None parameter to dispatch_run so the cron poller / webhook handler can reuse a single SDK client. Rebased on top of e41e256.
  • 14275df chore(control-plane): mirror oz_workflows + entrypoints into the function — Vercel installCommand (scripts/vercel_install.sh) mirrors .github/scripts/oz_workflows/ and the four PR entrypoints into control-plane/lib/ before the build step. Mirrored copies are git-ignored so .github/scripts/ stays the single source of truth.
  • 5d8d0e3 refactor(workflows): expose gather/build/apply helpers on PR entrypoints — refactors review_pr.py, respond_to_pr_comment.py, verify_pr_comment.py, and enforce_pr_issue_state.py to expose gather_*_context / build_*_prompt / apply_*_result (plus enforce_pr_state_synchronously for the deterministic enforce-PR path). The legacy main() entrypoints continue to call the same helpers so the GitHub Actions path stays byte-for-byte equivalent.
  • 381a11d feat(control-plane): wire builders, handlers, webhook, and cron registry
    • lib/builders.py: one PromptBuilder per cloud workflow + build_builder_registry.
    • lib/handlers.py: one WorkflowHandlers per cloud workflow + build_handler_registry. Each handler mints a fresh GitHub App-installation token per call.
    • api/webhook.py:process_webhook_request now signature-verifies, routes, runs the synchronous enforce-pr-issue-state path inline, evaluates the route through the builder registry, dispatches via dispatch_run, persists RunState to KV, and returns 202 with {run_id, dispatched, ...}. Errors at any stage surface as a structured 500.
    • api/cron.py:build_workflow_handlers returns the concrete handler registry instead of an empty {}.
  • 8dc2ea5 test(control-plane): add builders + handlers + webhook dispatch unit tests — adds 23 new tests across tests/test_builders.py, tests/test_handlers.py, and tests/test_webhook_dispatch.py. Updated control-plane/README.md to drop the "currently no-op" caveat and document which workflows are live (PR-flow) vs. still pending (issue-flow).

What's live

  • review-pull-request (PR opens, ready_for_review, oz-review label, /oz-review command).
  • respond-to-pr-comment (@oz-agent mentions on PR conversation comments, review comments, and review bodies).
  • verify-pr-comment (/oz-verify command).
  • enforce-pr-issue-state (PR synchronize / edited). Synchronous allow/close decisions run inline in the webhook; the need-cloud-match branch dispatches a cloud agent run.

The issue-triggered workflows and the plan-approval workflows are still routed by lib/routing.py but ignored at dispatch time. They keep flowing through the legacy GitHub Actions paths until a follow-up PR adds their builders / handlers.

Validation

  • PYTHONPATH=.github/scripts python -m unittest discover -s .github/scripts/tests → 494 OK
  • python -m pytest control-plane/tests → 110 passed (up from 87)
  • python -m compileall control-plane/api control-plane/lib .github/scripts/oz_workflows → clean

Operator notes

scripts/vercel_install.sh is now invoked as the project's installCommand. New env vars on the Vercel project:

  • OZ_GITHUB_APP_ID — App's numeric ID.
  • OZ_GITHUB_APP_PRIVATE_KEY — App's PEM-encoded private key.
  • GITHUB_API_BASE_URL (optional) — defaults to https://api.github.com. Override for GitHub Enterprise.

The new control-plane/README.md documents the full env-var matrix and the cutover steps.

Out of scope

  • Issue-triggered workflows (triage-new-issues, respond-to-triaged-issue-comment, create-spec-from-issue, create-implementation-from-issue).
  • trigger_implementation_on_plan_approved.py and remove_stale_issue_labels_on_plan_approved.py.
  • Deleting any GitHub Actions workflow YAMLs.

The legacy GitHub Actions paths still pass all tests and continue to handle every workflow during the cutover.

@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented Apr 28, 2026

@captainsafia

I'm running /oz-verify for this pull request using the repository's verification-enabled skills.

I couldn't run /oz-verify because this repository does not currently expose any skills with metadata.verification: true under .agents/skills/.

Powered by Oz

@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented Apr 28, 2026

@captainsafia

I'm working on changes requested in this PR (responding to a PR conversation comment).

You can view the conversation on Warp.

I pushed changes to this PR based on the comment.

Next steps:

  • Review the changes pushed to this PR.
  • Follow up with another comment if further adjustments are needed.

Powered by Oz

Copy link
Copy Markdown
Collaborator Author

Fix WorkflowProgressComment lifecycle regression

Operators were noticing that real PR webhook deliveries dispatched a cloud agent run successfully but never produced any user-visible "Oz is starting to review this pull request..." progress comment, never updated with the session-share link, and never replaced with the final review body. This change wires the WorkflowProgressComment lifecycle across the webhook → cron seam so the comment now appears at dispatch and is driven all the way through to completion.

What changed

be520ed — foundation in canonical oz_workflows + entrypoints

  • WorkflowProgressComment.__init__ now accepts optional comment_id, run_id, oz_run_id, session_link kwargs so callers can rebuild an instance bound to the comment posted at dispatch time.
  • apply_review_result / apply_pr_comment_result / apply_verification_result / apply_issue_association_result each accept an optional progress kwarg. When provided, the helper reuses it instead of constructing a fresh one. The legacy GitHub Actions runtime contract is preserved.

e9b23b7 — builders post the "starting..." comment before dispatch

  • Each builder now constructs a WorkflowProgressComment, posts the workflow-specific opening line via progress.start(...), and stashes progress_comment_id + progress_run_id into DispatchRequest.payload_subset so the cron poller can reconstruct the same comment.
  • The review/respond/verify builders post directly via a new _start_progress_comment helper. The enforce builder hands its progress instance to enforce_pr_state_synchronously, which already drives the start line for the need-cloud-match branch.

8592b5f — cron drives the rest of the lifecycle

  • New non_terminal_handler protocol on WorkflowHandlers: fires on every pending poll. The poller invokes it with the current run, and each PR-flow handler reconstructs the WorkflowProgressComment from the persisted payload_subset and calls record_run_session_link(progress, run) so the session-share link surfaces in the comment as soon as Oz reports it. Failures are absorbed.
  • Each result_applier reconstructs the same progress instance and passes it into the workflow-specific apply_*_result so the final progress.complete / progress.replace_body edits the original comment.
  • Each failure_handler similarly reconstructs the progress and calls progress.report_error(), replacing the in-flight progress comment with the workflow-error message instead of orphaning it.

7d078d1 — vendored mirror refresh

  • bash control-plane/scripts/vercel_install.sh to mirror the canonical changes into control-plane/lib/oz_workflows/ and control-plane/lib/scripts/. Vercel ships only the committed mirror, so the install script's output is committed as part of this stack.

Tests

  • control-plane/tests/test_builders.py: each builder asserts WorkflowProgressComment(...).start(...) was called and that progress_comment_id / progress_run_id land in payload_subset.
  • control-plane/tests/test_handlers.py: each result_applier asserts the reconstructed progress is forwarded to apply_*_result(progress=...). failure_handler asserts progress.report_error() is invoked. New non_terminal_handler test asserts record_run_session_link(progress, run) is called on the rebuilt instance.
  • control-plane/tests/test_poll_runs.py: new tests for the non-terminal hook, including the failure-absorption contract.

Validation gate

Suite Result
python -m pytest control-plane/tests 120 passed, 14 subtests passed
PYTHONPATH=.github/scripts python -m unittest discover -s .github/scripts/tests 494 passed
python -m compileall control-plane/api control-plane/lib .github/scripts/oz_workflows clean

Copy link
Copy Markdown
Collaborator Author

Pushed 03a7e63 to fix the failing CI tests, plus opened #401 to fix the legacy run_tests_on_push / Run tests check that runs from main's pr-hooks.yml.

What was failing

  1. Run tests (this PR's new workflow)python -m pytest tests failed with No module named pytest. The new requirements.txt is intentionally scoped to runtime deps for the Vercel function bundle and does not include pytest / pytest-subtests.
  2. run_tests_on_push / Run tests (reusable from main)python -m unittest discover -s .github/scripts/tests fails with ModuleNotFoundError: No module named 'oz_workflows' because this PR moved oz_workflows from .github/scripts/oz_workflows to lib/oz_workflows, but the reusable workflow on main still exports PYTHONPATH=.github/scripts.
  3. Vercel — the deployment fails ~4s after starting, before any build output. See remaining issue below.

Fixes

  • In this PR (03a7e63): install pytest>=8,<9 and pytest-subtests>=0.13,<1 as a separate workflow step, and document the same install in CONTRIBUTING.md. The new Run tests check now passes (110 passed, 14 subtests passed).
  • In fix(ci): include lib in PYTHONPATH for legacy run-tests #401 (against main): extend the legacy reusable run-tests.yml's PYTHONPATH to lib:.github/scripts. Non-existent PYTHONPATH entries are ignored, so PRs that haven't migrated yet are unaffected. After fix(ci): include lib in PYTHONPATH for legacy run-tests #401 lands, the run_tests_on_push / Run tests check on this PR should go green on the next push.

Vercel deployment — likely needs project-side change

The Vercel deployment fails almost immediately, which usually means Vercel rejected the build before running any commands. The previous control-plane/vercel.json and the new root vercel.json are byte-for-byte identical aside from their location, so the relocation itself looks fine.

The most likely cause is the Vercel project's Root Directory setting (Project Settings → General → Root Directory) — if it's still pinned to control-plane, that path no longer exists on this branch and the deployment will fail before doing anything. Updating that setting to the repo root (or removing it) should unblock the deploy. I don't have access to the Vercel dashboard to verify or change this; happy to follow up if you confirm the setting.

Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
Rename the shared oz_workflows package to oz and move executable workflow modules from lib/scripts into lib/workflows. Update imports, documentation, specs, and tests to use the new package layout.

Co-Authored-By: Oz <oz-agent@warp.dev>
Move the control-plane package from lib to core and update runtime imports, PYTHONPATH configuration, docs, and tests to match the new package layout.

Co-Authored-By: Oz <oz-agent@warp.dev>
@captainsafia captainsafia marked this pull request as ready for review April 30, 2026 15:49
@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented Apr 30, 2026

@captainsafia

I'm starting a first review of this pull request.

You can follow along in the session on Warp.

I completed the review and posted feedback on this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

Copy link
Copy Markdown
Contributor

@oz-for-oss oz-for-oss Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR replaces the legacy GitHub Actions/Docker workflow surface with a Vercel webhook and cron control plane, cloud-run workflow adapters, and artifact-based handoff for review, triage, spec, and implementation paths.

Concerns

  • The implementation apply path accepts any agent-reported branch that merely starts with the expected target branch, which can select a sibling issue branch.
  • The create-spec and create-implementation apply paths lost the previous one-minute timestamp cushion when checking whether the agent pushed changes, so successful cloud runs can be treated as no-ops under normal GitHub/Oz clock skew.

Security

  • The branch override is derived from an agent-produced artifact after the agent has read untrusted issue content; constrain it to the exact target branch or a delimiter-bounded slug before using it to update or open PRs.

Verdict

Found: 0 critical, 3 important, 0 suggestions

Request changes

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

Comment thread core/workflows/create_implementation_from_issue.py Outdated
Comment thread core/workflows/create_implementation_from_issue.py Outdated
Comment thread core/workflows/create_spec_from_issue.py Outdated
Constrain implementation branch overrides to the expected branch or a delimiter-bounded suffix and restore the one-minute timestamp cushion when checking agent-pushed spec and implementation branches.

Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
@captainsafia captainsafia merged commit 6a5ac7c into main Apr 30, 2026
11 of 12 checks passed
@captainsafia captainsafia deleted the oz-agent/control-plane-migration branch May 1, 2026 01:33
sebryu added a commit to sebryu/oz-for-oss that referenced this pull request May 1, 2026
…rkflows

Addresses Oz's CHANGES_REQUESTED on this PR: the five workflow files
restored in the previous commit reference composite actions
(`build-triage-image`, `run-oz-python-script`) and Python entrypoints
that were also deleted by PR warpdotdev#400. Without these, the workflows would
still fail at action resolution before any script could run.

This commit restores the dependency closure for those five workflows,
all from commit 6ffca63 (the parent of
the deletion commit 6a5ac7c), scoped to what the legacy issue-triggered
helpers actually need:

Composite actions:
- .github/actions/build-triage-image/action.yml
- .github/actions/run-oz-python-script/action.yml

Python entrypoints (one per restored workflow):
- .github/scripts/respond_to_triaged_issue_comment.py
- .github/scripts/comment_on_unready_assigned_issue.py
- .github/scripts/update_dedupe.py
- .github/scripts/update_pr_review.py
- .github/scripts/update_triage.py

Shared library used by the entrypoints:
- .github/scripts/oz_workflows/{__init__,actions,artifacts,docker_agent,env,helpers,oz_client,repo_local,triage,verification,workflow_config,workflow_paths}.py
- .github/scripts/requirements.txt

Triage container (built by `build-triage-image`, run by docker_agent.py):
- docker/triage/{Dockerfile,README.md,entrypoint.sh}
- uv.toml

Intentionally left deleted, because PR warpdotdev#400 explicitly migrated them to
the Vercel webhook control plane:

- review-pull-request workflow + review_pr.py + build-review-image action + docker/review/
- enforce-pr-issue-state + enforce_pr_issue_state.py
- respond-to-pr-comment + respond_to_pr_comment.py
- verify-pr-comment + verify_pr_comment.py
- triage-new-issues + triage_new_issues.py
- resolve_review_context.py
- All test files (not strictly needed for runtime)

Verification:
- Grepped all restored Python files for imports; every `oz_workflows.*`
  module referenced is included in this restore.
- Grepped all restored YAML for `uses: warpdotdev/oz-for-oss/...`; only
  `build-triage-image` and `run-oz-python-script` are referenced and
  both are restored.

Refs: warpdotdev#418
captainsafia pushed a commit to warpdotdev/warp that referenced this pull request May 1, 2026
…9843)

## Summary

Removes two GitHub Actions adapter workflows that delegate to reusable
workflows in `warpdotdev/oz-for-oss` that no longer exist:

- `.github/workflows/respond-to-triaged-issue-comment-local.yml`
- `.github/workflows/comment-on-unready-assigned-issue-local.yml`

Both have been failing on every trigger since 2026-04-30, when
[`warpdotdev/oz-for-oss#400`](warpdotdev/oz-for-oss#400)
deleted the upstream targets as part of migrating to a Vercel webhook
control plane.

## Context

The [`oz-for-oss` maintainer
confirmed](warpdotdev/oz-for-oss#418 (comment))
the upstream deletions were intentional and asked us to remove the
warp-side adapters rather than restoring upstream. Per-workflow
disposition (her words):

- `respond-to-triaged-issue-comment` — "covered by the issue_comment
hook" of the new Vercel webhook.
- `comment-on-unready-assigned-issue` — "currently being rewired"
upstream.

Root-cause issue:
[`warpdotdev/oz-for-oss#418`](warpdotdev/oz-for-oss#418).

## Verification

Empirical confirmation that the new webhook is live and handling
`@oz-agent` mentions on triaged warp issues — sub-15s response latency
from the `oz-for-oss` bot, well below GHA cold-start times:

- Issue #8642 (2026-05-01 15:24 UTC): `@oz-agent` → bot reply 13s later
- Issue #9576 (2026-04-30 23:33 UTC): `@oz-agent` → bot reply 8s later
- Issue #9688 (2026-04-30 23:33 UTC): `@oz-agent` → bot reply 8s later

Meanwhile every fire of `respond-to-triaged-issue-comment-local.yml`
since 2026-04-30 has logged `failure` (e.g. run
[`25207261218`](https://github.com/warpdotdev/warp/actions/runs/25207261218))
— the GHA path is purely noise.

For `comment-on-unready-assigned-issue` there is a brief coverage gap
until oz-for-oss finishes rewiring the `issues.assigned` event handler.
Failing-and-noisy is worse than absent, so removing now improves signal.

## Scope note

Three additional broken adapters exist for `update-dedupe`,
`update-pr-review`, `update-triage` (weekly scheduled skill-refresh
workflows). Those are intentionally left in place pending confirmation
from `oz-for-oss` on whether the new control plane runs equivalent
scheduled jobs against warp, or whether the replacement scheduled agents
are still upcoming work. Those three have **never** executed for warp
(`gh run list` returns `[]` for each), so leaving them does not affect
any current automation.

## Test plan

- [x] No required status checks affected — verified via repo ruleset
15469325; only `Check CI results` is required.
- [x] No PRs blocked: these workflows trigger on `issue_comment` and
`issues`, never on `pull_request`.
- [x] Empirical webhook coverage verified for
`respond-to-triaged-issue-comment` (see Verification).
- [ ] Reviewer to sanity-check that removing these workflows is
acceptable given the brief coverage gap for
`comment-on-unready-assigned-issue` flagged above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants